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1 Background 

O 

In many applications a function has to be called very often inside a loop. One such 
i-C application in numerical analysis is the finite element method (FEM) for the approx- 

imate solution of a partial differential equation (PDE). For example we would like to 
approximate the solution u : £1 — > R of the PDE 

-V • (oV«) + b ■ Vu + cu = f in O 

u = g on dQ. 



where ft C R d is a (i-dimensional domain with boundary dfl and a, c, f : O — > R, 
Q | 6 : — >• R rf and g : dQ — > R are given functions with special properties that will not be 

discussed here. In FEM a domain is discretized into a mesh by splitting the domain into 
"simple" geometric shapes (intervals, triangles, tetrahedrons, . . . ). Along with special 
\^ functions (usually piecewise polynomials) these shapes are called elements. A system of 

linear algebraic equations Ax = b is obtained by computing integrals on each element 
and sorting them into a large (and usually sparse) "system" matrix A and right hand 
side (RHS) b. 

From a computational point of view this can be achieved by writing a function 
^N) getElementlntegrals ( . . .) that computes all necessary integrals on one element and 

is then called within a loop for each element of the mesh. A corresponding Python code 
could look like this: 

• T- 1 

for el in mesh : 

elMat , elRHS = getElement Integral s ( el , a , b , c , f , g) 

# process entries of elMat and elRHS by sorting them 

# into a "big" matrix and right hand side. 
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Now imagine that the mesh is very fine, i.e. the number of elements in the mesh is large. 
For example a uniform tetrahedral discretization of the unit cube [0, l] 3 with grid size 
h = 1/n (n 6 N) results in a mesh consisting of 6n 3 tetrahedral elements. The function 
getElementlntegrals is thus called 6n 3 times. 

Especially in interpreted programming languages like MATLAB, Octave or Python a 
function call may be very time-consuming. By function call we mean the setup needed 
for the function to start executing the actual function code in the function's body and 
cleaning it up afterwards. This includes possible copying of memory and dynamic type 
checking for the parameters passed to and returned from the function. 

In the above setting the functions a, b, c, f , g and even getElementlntegrals can 
often be evaluated in a fast way. However, the function call itself may exhibit an over- 
head that consumes far more time than the actual function code. In this report we 
present results of function call benchmarks for programming languages or interpreters 
often used in numerical analysis: MATLAB/ Octave, Python and C. While C is a com- 
piled language and optimizations are possible even on a very low level, MATLAB and 
the free software alternative Octave are interpreted languages and mainly draw on auto- 
mated optimization and low-level improvements are usually only possible by switching 
to plain C, C++ or Fortran with so-called mex-files. In contrast, in the interpreted 
language Python time-critical parts can be compiled with Cython - the C-Extensions 
for Python [1]. Cython's syntax is very similar to Python's and introduces static typing 
as well as the ability to call C code easily. 

Here we present a benchmark that helps to identify and quantify optimization po- 
tentials with respect to time consumption caused by function calls in the mentioned 
languages. Section 2 describes the setup of the benchmark and Section 3 presents and 
discusses the results. 

2 Benchmark setup 

The situation outlined in Section 1 can be boiled down to a function that is called 
in a loop very often. For benchmarking the overhead of the function call itself it is 
reasonable to make the function body as simple as possible. Therefore, we will use a 
function that accepts one double precision floating-point number and returns its square. 
We ran separate tests for several types of function definitions that are available in the 
used programming languages. The different options are enumerated for later reference. 

MATLAB and Octave 

Option 1. (a) The called function defined in an external .m-file: 

function result = f un_external (a) 
result = a*a; 

end 

The loop is defined in a separate file: 
function result = loop_external (n) 
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result = zeros (n , 1 ) ; 
for i=l : n 

result(i) = f un_external ( i ) ; 

end 

end 

(b) Both functions from (a) are placed in the same file consecutively. 

(c) A nested function definition in the body of the calling function is used, 
that is the called function is placed inside the body of the loop function: 

function result = loop_nested (n) 
result = zeros(n,l); 
for i=l : n 

result(i) = f un_nest ed ( i ) ; 

end 

function ret = f un_nested (a) 
ret = a*a; 

end 

end 

(d) Anonymous function definition in the body of the calling function: 
f un_anonymous = @(a) (a*a); 

Python 

Python is an interpreted language but can be tuned by writing time-critical parts 
in Cython [1]. With Cython one can blend Python code and C code easily. We 
take a closer look at the following options for implementing the loop and the called 
function: 

Option 1. The called function can be implemented 

(a) together with the loop function in the same .py-file or 

(b) in a separate .py-file and imported in the .py-file implementing the loop. 
Option 2. The loop can be implemented with a numpy array [3] in 

(a) plain Python: 

def loop (n) : 

result = numpy . empty (n) 
for i in xrange(0,n): 

result [i] = f un_samef ile ( i +1 ) 
return result 

or in 

(b) Cython where the code still is plain Python code as in (a) or in 
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(c) Cython enriched with static typing: 

def loop (n) : 

cdef numpy . ndarray [numpy . double_t ] result = \ 

numpy . empty (n) 
cdef int i 

for i in xrange(0,n): 

result [i] = f un_samef ile ( i +1 ) 
return result 

Note that the result array and the i variable are now typed which allows 
Cython to address the elements of the numpy array in the loop efficiently. 

Option 3. Similarly, the called function can be implemented in 

(a) plain Python: 

def f un ( a) : 

return a**2 

or in 

(b) Cython where the code still is plain Python code as in (a) or in 

(c) Cython enriched with static typing: 

cpdef double fun(double a): 
return a**2 

If the function is imported in the .py-file running the loop then an ad- 
ditional .pxd-file with the corresponding function declaration should be 
provided. A .pxd-files works like a C header file and in our case simply 
contains the line 

cpdef double fun(double a) 

Several combinations are not possible and are thus omitted. For example, op- 
tion 1 (a) with option 2 (a) and option 3 (b) are impossible because both the loop 
and the called function are compiled with Cython if they are defined in the same 
file). 



Option 1. (a) The function is defined in the same .c-file as the loop and compiled with 
the options -03 -f omit-f rame -pointer. The function code is 

double fun(double a) { 
return a*a; 

} 

while the loop code is 
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double* loop(int n) { 
double* result = 

(double*) malloc (sizeof (double)*n) ; 
for (int i=0; i<n; i++) 
result [i] = fun(i); 
free (result ) ; 
return result ; 

} 

(b) The function is compiled in a shared library (.so-file) which is then dy- 
namically linked to the compiled loop function. The compiler options are 
the same as for (a). 

For further details we refer to the source code [2]. 

3 Benchmark results 

In this section we present results of the benchmark setup described in Section 2 conducted 
with the languages/interpreters 

• MATLAB 2011b 

• Octave 3.2.4 

• Python 2.7.2 

• Cython 0.14.1 (C-Extensions for Python) 

• C with GCC 4.6.1. 

All experiments have been carried out on a Intel Core i5 M540 CPU running at 2.53 GHz 
with Ubuntu 11.10. We computed a 2 for a = 1, . . . , 10 7 with all possible variations of 
implementations with the options presented in 2. By using this test setup we wish to 
identify and quantify possibilities for optimization with respect to time consumption 
caused by function calls. The experiment was repeated 10 times and the arithmetic 
mean of the measured timings are presented in Table 1. 

The files used for the experiments are published [2] under GPL3 so further results can 
be produced with later versions of the above software and on different hardware. 

Unsurprisingly, both C implementations with enabled compiler optimizations are the 
fastest implementations in this benchmark. The fact that Octave is slower with function 
calls than MATLAB is also well-known. More interesting is the observation that Python 
and MATLAB approximately consume the same amount of time when no optimizations 
are used in Python. However, we can see in the first lines of Table 1 that Python can 
be tuned with the C-Extensions Cython such that the execution time reaches the one of 
the dynamically linked C implementation, which is about 40 times faster than the plain 
Python or MATLAB implementation. 
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Table 1: Execution time in seconds for 10 function calls. The variants are described in 
Section 2. 

The possibility to write performance-critical parts in the Python-like Cython syntax 
is a clear advantage over MATLAB and Octave because currently an optimization of 
function calls in MATLAB can only be achieved by writing .mex- files that require a 
complete rewrite of the code in another language and are often hard to handle - especially 
if several versions of MATLAB and thus the mex- API are used. However, we want to 
point out that in principle the performance of the C variants can be achieved with 
mex-based implementations by calling C code in the mex-files. 

The Cython approach requires less effort since the Cython syntax is very similar to 
the plain Python syntax. Thus optimizations can be implemented easily with Cython 
where they are needed while maintaining the full flexibility of Python. 
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